-
Notifications
You must be signed in to change notification settings - Fork 576
Prepare mktables for Unicode 15.1 and 16.0 #23133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (defined (my $bmg = property_ref('Bidi_Mirroring_Glyph'))) { | ||
$bmg->set_to_output_map($EXTERNAL_MAP); | ||
$bmg->set_range_size_1(1); | ||
} | ||
|
||
property_ref('Numeric_Value')->set_to_output_map($OUTPUT_ADJUSTED); | ||
|
||
# These two properties have no short names and the file names for them | ||
# clash in DOS 8.3. Work around this by creating shorter file names that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are we still limited by 8.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On IRC the other day, I asked if we were still limited, and the answer was yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unicode filenames yes, but for ASCII filenames we don't AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave this as-is, since it is trivial to do, just in case. And I have WIP which should get rid of them altogether.
4894f2a
to
1f07a91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message for aa6faba has 2 misspellings. infrastructue
lacks the second r
. In incoroporated
the second o
needs removal.
1f07a91
to
de01c61
Compare
This p.r. for Unicode mktables did not make it into the March 20 dev release. Does that mean we have to defer it to the 5.43 dev cycle? |
The change isn't really user visible, it would only affect people who would want to patch in a more recent Unicode version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khwilliamson there's one unresolved conversation in this p.r. If you mark that resolved, then I think this is okay to merge.
There are more commits coming |
Add comments, and rewrap comment lines to fit 80 columns
Unicode 15.1 introduces this new property, which needs the same special handling as plain NFKC_Casefold does.
These files are changed in 15.1 to have @missings lines, whereas they didn't before. This leads to some warnings messages, so turn off looking at them, as we do for a number of other files.
We handle it by ignoring this file, new to Unicode 16.0. It consists of lists of characters that, to put it less delicately than Unicode would like, they regret creating. But there are no rules associated with them. It would be nice to have a \p{DoNotEmit} property so that applications could handle situations where this occurs. But I'm fearful that if we did something like this, that Unicode would later come up with something that had the same intention but would be subtly or unsubtly different. That has happened before, to our detriment. So I think we should wait to see what they do do, in future releases.
de01c61
to
8f58648
Compare
8f58648
to
5b52ed1
Compare
This includes several new properties, some of which are considered "provisional" by Unicode, which means they can be heavily revised or withdrawn. These properties are designed for use by scholars of hieroglyphics.
These new properties are automatically handled, but there is a problem. They have no short form names. Files are written for them based on their names, and those files are not distinguishable on a DOS 8.3 file system. The solution here is to manually override the automatically generated file names with distinguishable ones.
mktables does a lot of sanity checks on the data it gets fed. One of those is to make sure any \d group of code points is 10 long. This verifies that Unicode has given us enough code points to form 0-9. It assumes that if it got this much right, that their numeric values are also 0-9. This check has uncovered issues with the Unicode Standard in the past. Nowadays, they've cleaned up their act, and it's been many releases since there has been problems. But our checks remain, and I think they should. What happens in Unicode 16.0 was there was a range of \d characters that contain two consecutive groups of 0-9 values. The check could be changed to verify that the count is divisible by 10, but checking for this particular range is a bit safer.
There is already this method for lists of Ranges, so this is is just so callers don't need to know which they are operating on.
5b52ed1
to
32ee519
Compare
This has been repushed, with the new hieroglyphic properties now working |
I think the PSC @haarg @ap @book should consider if we should ship Perl 5.42 without updating the Unicode version. We are now one major version and one dot release behind. The reason is solely the break properties have been very difficult to update. The rest of the releases update smoothly. The break properties are what matches regular expression constructs Technically, it is past the deadline for such changes in this development cycle. A program that depends on a particular code point being unassigned could fail when that code point does get assigned to be a specific character. And the new releases assign thousands of new characters. On the other hand, one could argue that such a program is incorrect, as it depends on the stability of something that is inherently unstable and documented as such. (There are some code points that aren't ever going to become characters, so you could use some of those. Or there are ones that Unicode would have to be pretty desperate to assign. such as the one that is in the position to be a capital Greek Final Sigma. But there is no such character, but they left a hole where it would have appeared so as to not mess up the symmetry of the rest of the Greek encoding) Unicode has changed the line breaking algorithm for some Indic characters. If you relied on the old algorithm your code would break, but on the other hand people would be mad at you for not giving them the results the language dictates. The break algorithms are declared to be unstable by Unicode. I had hoped to get a PR ready by today, but I ran out of time, though I'm close. I do think we should make some effort to keep up with Unicode releases. |
perldelta not needed until the actual releases are incorporated.